-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aDAG] Allow custom NCCL group for aDAG #47141
Conversation
1fd9edf
to
d7d19b3
Compare
""" | ||
|
||
@abstractmethod | ||
def get_rank(self, actor: ray.actor.ActorHandle) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it find the next rank correctly? (is there any assumption in the rank order that could be different from communicator that's passed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to get the next rank? As long as there is a map from Actor to rank it will be sufficient?
a212a01
to
9687899
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Some last comments
python/ray/dag/compiled_dag_node.py
Outdated
@@ -776,7 +783,16 @@ def _preprocess(self) -> None: | |||
if None in nccl_actors: | |||
raise ValueError("Driver cannot participate in the NCCL group.") | |||
if nccl_actors and self._nccl_group_id is None: | |||
self._nccl_group_id = _init_nccl_group(nccl_actors) | |||
if self._custom_nccl_group: | |||
self._nccl_group_id = _set_nccl_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of having if/else here, is there any way to make the default cupy nccl group to follow the same interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChannelContext.nccl_groups
has the type Dict[str, "ActorGroup"]
, that is sufficient?
ef33b3e
to
bcaae80
Compare
b1c5bc2
to
3f62bff
Compare
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
7844bc8
to
c19d7a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 3 questions; Let's merge it after addressing these.
- Is "custom_nccl_channel" necessary? I wonder if we can just make it overwrite default nccl channel (so remove the term custom nccl channel, and we just allow to overwrite nccl channel using a handle)?
- Can you add tests with torch distributed (similar to vllm)?
- I wonder how the API will be changed with overlap compute/comm.
5e69f60
to
d1aff64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comments. Let's merge it after addressing it
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Allow custom NCCL group for aDAG so that we can reuse what the user already created. Marking NcclGroupInterface as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability. vLLM prototype: vllm-project/vllm#7568 Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Why are these changes needed?
Allow custom NCCL group for aDAG so that we can reuse what the user already created.
Marking
NcclGroupInterface
as DeveloperAPI for now. After validation by using it in vLLM we can change to alpha stability.vLLM prototype: vllm-project/vllm#7568
Related issue number
Closes #45593
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.